Buying and selling used phones and tablets used to be something that happened on a handful of online marketplace sites. But the used and refurbished device market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth \$52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used phones and tablets that offer considerable savings compared with new models.
Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing one. There are plenty of other benefits associated with the used device market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished devices. Maximizing the longevity of devices through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost this segment as consumers cut back on discretionary spending and buy phones and tablets only for immediate needs.
The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished devices. ReCell, a startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to analyze the data provided and build a linear regression model to predict the price of a used phone/tablet and identify factors that significantly influence it.
The data contains the different attributes of used/refurbished phones and tablets. The data was collected in the year 2021. The detailed data dictionary is given below.
# this will help in making the Python code more structured automatically when coding with Jupiter(good coding practice)
#%load_ext nb_black
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
pd.options.display.float_format = '{:.2f}'.format
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# split the data into train and test
from sklearn.model_selection import train_test_split
# to build linear regression_model
import statsmodels.api as sm
# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
project=pd.read_csv("used_device_data.csv")
df=project.copy()
# let's check the shape of the data
df.shape
(3454, 15)
#let's check the first 5 rows of the data
df.head()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 14.50 | yes | no | 13.00 | 5.00 | 64.00 | 3.00 | 3020.00 | 146.00 | 2020 | 127 | 4.31 | 4.72 |
| 1 | Honor | Android | 17.30 | yes | yes | 13.00 | 16.00 | 128.00 | 8.00 | 4300.00 | 213.00 | 2020 | 325 | 5.16 | 5.52 |
| 2 | Honor | Android | 16.69 | yes | yes | 13.00 | 8.00 | 128.00 | 8.00 | 4200.00 | 213.00 | 2020 | 162 | 5.11 | 5.88 |
| 3 | Honor | Android | 25.50 | yes | yes | 13.00 | 8.00 | 64.00 | 6.00 | 7250.00 | 480.00 | 2020 | 345 | 5.14 | 5.63 |
| 4 | Honor | Android | 15.32 | yes | no | 13.00 | 8.00 | 64.00 | 3.00 | 5000.00 | 185.00 | 2020 | 293 | 4.39 | 4.95 |
#checking the total number of phones sold according to brand
unique_brand=df["brand_name"].value_counts()
print(unique_brand)
Others 502 Samsung 341 Huawei 251 LG 201 Lenovo 171 ZTE 140 Xiaomi 132 Oppo 129 Asus 122 Alcatel 121 Micromax 117 Vivo 117 Honor 116 HTC 110 Nokia 106 Motorola 106 Sony 86 Meizu 62 Gionee 56 Acer 51 XOLO 49 Panasonic 47 Realme 41 Apple 39 Lava 36 Celkon 33 Spice 30 Karbonn 29 Coolpad 22 BlackBerry 22 Microsoft 22 OnePlus 22 Google 15 Infinix 10 Name: brand_name, dtype: int64
#checking the total number of phones sold according to internal memory
unique_int_mem=df["int_memory"].value_counts()
print(unique_int_mem)
16.00 1283 32.00 1083 64.00 509 128.00 372 256.00 86 512.00 44 0.06 18 0.10 17 8.00 12 4.00 10 1024.00 8 0.50 4 0.20 2 24.00 1 0.01 1 Name: int_memory, dtype: int64
#checking the total number of phones sold according to OS
unique_os=df["os"].value_counts()
print(unique_os)
Android 3214 Others 137 Windows 67 iOS 36 Name: os, dtype: int64
#Lets check unique number of phones sold according to main camera mp
unique_maincam=df["main_camera_mp"].value_counts()
print(unique_maincam)
13.00 1035 8.00 755 5.00 546 12.00 189 2.00 171 16.00 154 3.15 122 0.30 69 10.50 30 1.30 26 23.00 19 21.00 19 20.70 16 20.00 13 14.50 13 12.20 12 4.00 10 12.30 9 19.00 8 8.10 7 13.10 6 10.00 6 24.00 5 3.00 5 12.50 4 48.00 4 6.50 4 6.70 4 21.20 2 21.50 1 1.20 1 16.30 1 22.60 1 18.00 1 12.60 1 20.10 1 41.00 1 20.20 1 1.00 1 0.08 1 22.50 1 Name: main_camera_mp, dtype: int64
#Lets check unique number of phones sold according to screen_size
unique_scrensize=df["screen_size"].value_counts()
print(unique_scrensize)
12.70 575
12.83 409
10.29 219
10.16 188
15.24 129
...
17.40 1
15.82 1
14.53 1
16.15 1
14.81 1
Name: screen_size, Length: 142, dtype: int64
#checking the total number of phones sold according to ram
unique_os=df["ram"].value_counts()
print(unique_os)
4.00 2815 6.00 154 8.00 130 2.00 90 0.25 83 3.00 81 1.00 34 12.00 18 0.02 18 0.03 17 0.50 9 1.50 1 Name: ram, dtype: int64
# let's check column types and number of values
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3454 entries, 0 to 3453 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3454 non-null object 1 os 3454 non-null object 2 screen_size 3454 non-null float64 3 4g 3454 non-null object 4 5g 3454 non-null object 5 main_camera_mp 3275 non-null float64 6 selfie_camera_mp 3452 non-null float64 7 int_memory 3450 non-null float64 8 ram 3450 non-null float64 9 battery 3448 non-null float64 10 weight 3447 non-null float64 11 release_year 3454 non-null int64 12 days_used 3454 non-null int64 13 normalized_used_price 3454 non-null float64 14 normalized_new_price 3454 non-null float64 dtypes: float64(9), int64(2), object(4) memory usage: 404.9+ KB
#Checking statistcal summary
df.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| brand_name | 3454 | 34 | Others | 502 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| os | 3454 | 4 | Android | 3214 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| screen_size | 3454.00 | NaN | NaN | NaN | 13.71 | 3.81 | 5.08 | 12.70 | 12.83 | 15.34 | 30.71 |
| 4g | 3454 | 2 | yes | 2335 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5g | 3454 | 2 | no | 3302 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| main_camera_mp | 3275.00 | NaN | NaN | NaN | 9.46 | 4.82 | 0.08 | 5.00 | 8.00 | 13.00 | 48.00 |
| selfie_camera_mp | 3452.00 | NaN | NaN | NaN | 6.55 | 6.97 | 0.00 | 2.00 | 5.00 | 8.00 | 32.00 |
| int_memory | 3450.00 | NaN | NaN | NaN | 54.57 | 84.97 | 0.01 | 16.00 | 32.00 | 64.00 | 1024.00 |
| ram | 3450.00 | NaN | NaN | NaN | 4.04 | 1.37 | 0.02 | 4.00 | 4.00 | 4.00 | 12.00 |
| battery | 3448.00 | NaN | NaN | NaN | 3133.40 | 1299.68 | 500.00 | 2100.00 | 3000.00 | 4000.00 | 9720.00 |
| weight | 3447.00 | NaN | NaN | NaN | 182.75 | 88.41 | 69.00 | 142.00 | 160.00 | 185.00 | 855.00 |
| release_year | 3454.00 | NaN | NaN | NaN | 2015.97 | 2.30 | 2013.00 | 2014.00 | 2015.50 | 2018.00 | 2020.00 |
| days_used | 3454.00 | NaN | NaN | NaN | 674.87 | 248.58 | 91.00 | 533.50 | 690.50 | 868.75 | 1094.00 |
| normalized_used_price | 3454.00 | NaN | NaN | NaN | 4.36 | 0.59 | 1.54 | 4.03 | 4.41 | 4.76 | 6.62 |
| normalized_new_price | 3454.00 | NaN | NaN | NaN | 5.23 | 0.68 | 2.90 | 4.79 | 5.25 | 5.67 | 7.85 |
#Check duplicate values
df.duplicated().sum()
0
#checking missing values
df.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 179 selfie_camera_mp 2 int_memory 4 ram 4 battery 6 weight 7 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
Questions:
Answers(from EDA done below):
The following brands with at least 8mp selfie camera had the largest range of selfie camera mp installed: Honor, Oppo, Samsung,and Lenovo. the median selfie camera mp across most brands except Oppo, coolpad, Acer,Asus, Micromax, Panasonic, and Sony is 16MP. Acer, Micromax,HTC, and Sony had the lowest median selfie mp. Oppo has the largest median selfie mp installed among all brands.
Screen size in inches,Normalized new price,days used,battery,weight, Ram, Selfie camera mp, and main camera carry high correlation with normalized used price. However, normalized new price carry the highest positive correalation whlile days used carry the highest negative correlation
Defining a custom function to make ploting of Histograms and Boxplots automated for Numerical variables.
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=df, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=df, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=df, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
df[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram(green line)
ax_hist2.axvline(
df[feature].median(), color="black", linestyle="-"
) # Add median to the histogram(black line)
Defining a custom function to aid in plotting of barplot for analysing the distribution of descrete variables
import seaborn as sns
import matplotlib.pyplot as plt
def labeled_barplot(df, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(df[feature]) # length of the column
count = df[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=df,
x=feature,
order=df[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
The standard SI unit for measuring phone screen size is in inches, hence we need to modify the column with the screen sizes converted from cm to inches,then visualize.
import pandas as pd
import matplotlib.pyplot as plt
# Convert centimeters to inches
df['screen_size_inches'] = df['screen_size'] * 0.393701
histogram_boxplot(df, 'screen_size_inches')
plt.title('Distribution of screen sizes')
plt.xlabel('Screen size' + ' (inches)')
Text(0.5, 0, 'Screen size (inches)')
histogram_boxplot(df, "main_camera_mp")
histogram_boxplot(df, "selfie_camera_mp")
histogram_boxplot(df, "int_memory")
plt.title('Histogram of INT. MEMORY')
plt.xlabel('INT. MEMORY' + ' (GB)')
Text(0.5, 0, 'INT. MEMORY (GB)')
histogram_boxplot(df, "ram")
plt.title('Distribution of RAM size')
plt.xlabel('RAM' + ' (GB)')
Text(0.5, 0, 'RAM (GB)')
histogram_boxplot(df, "battery")
plt.title('Distribution of battery capacity')
plt.xlabel('battery' + ' (mAH)')
Text(0.5, 0, 'battery (mAH)')
histogram_boxplot(df, "weight")
plt.title('Distribution of phone weight')
plt.xlabel('weight' + ' (grams)')
Text(0.5, 0, 'weight (grams)')
histogram_boxplot(df, "days_used")
plt.title('Distribution of days_used')
Text(0.5, 1.0, 'Distribution of days_used')
histogram_boxplot(df, "normalized_used_price")
plt.title('Distribution of normalized_used_price')
plt.xlabel('normalized_used_price' + ' (Euros)')
Text(0.5, 0, 'normalized_used_price (Euros)')
histogram_boxplot(df, "normalized_new_price")
plt.title('Distribution of normalized_new_price')
plt.xlabel('normalized_new_price' + ' (Euros)')
Text(0.5, 0, 'normalized_new_price (Euros)')
labeled_barplot(df, "brand_name", perc=True)
labeled_barplot(df, "os", perc=True)
labeled_barplot(df, "4g", perc=True)
labeled_barplot(df, "5g", perc=True)
A very large proportion of devices have 5g(95.6%), this is suprising considering its a new technology(launched 2019) and we have phones on the dataset whose release date as far back as 2013.This particular data has to be reviewed further.
labeled_barplot(df, "release_year", perc=True)
Plotting corrolation heatmap of numerical variables
# creating a list of numerical columns
num_cols = df.select_dtypes(include=np.number).columns.tolist()
# dropping release year from list of numerical columns as it is not numerical in nature but rather descrete
num_cols.remove("release_year")
#plotting correlation heat map
plt.figure(figsize=(12, 7))
sns.heatmap(
df[num_cols].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Observations
Plotting all numerical variables against each other using a scatterplot
df_attr = df.iloc[:, 0:7]
sns.pairplot(
df, diag_kind="kde"
)
<seaborn.axisgrid.PairGrid at 0x7fdd28058b20>
plt.figure(figsize=(30, 5))
sns.boxplot(data=df, x="brand_name", y="ram")
plt.show()
plt.figure(figsize=(30, 5))
sns.boxplot(data=df, x="brand_name", y="screen_size_inches")
plt.show()
screen_sizes_over6 = df[df['screen_size_inches'] > 6].shape[0]
print('There are',screen_sizes_over6,'phones over 6 inches accross all brands.')
There are 1228 phones over 6 inches accross all brands.
# Create a boxplot
plt.figure(figsize=(30, 5))
sns.boxplot(data=df, x='brand_name', y='selfie_camera_mp');
Plotting the distribution of phones with selfie camera over 8mp and different phone brands.
# Filter the DataFrame for selfie camera resolutions greater than 8MP
selfie_cam_over_8mp_df = df[df['selfie_camera_mp'] > 8]
# Create a boxplot
plt.figure(figsize=(30, 5))
sns.boxplot(data=selfie_cam_over_8mp_df, x='brand_name', y='selfie_camera_mp')
<Axes: xlabel='brand_name', ylabel='selfie_camera_mp'>
# Create a boxplot
plt.figure(figsize=(35, 10))
sns.boxplot(data=df, x='brand_name', y='days_used');
# Create a boxplot
plt.figure(figsize=(35, 10))
sns.boxplot(data=df, x='brand_name', y='normalized_used_price');
# Create a boxplot
plt.figure(figsize=(35, 10))
sns.boxplot(data=df, x='release_year', y='normalized_used_price');
# Create a boxplot
plt.figure(figsize=(35, 10))
sns.boxplot(data=df, x='5g', y='selfie_camera_mp');
# Create a boxplot
plt.figure(figsize=(35, 10))
sns.boxplot(data=df, x='5g', y='main_camera_mp')
plt.ylim(0,30)
(0.0, 30.0)
Observation
# Create a boxplot
plt.figure(figsize=(35, 10))
sns.boxplot(data=df,y='int_memory', x='release_year')
<Axes: xlabel='release_year', ylabel='int_memory'>
# Create a boxplot
plt.figure(figsize=(35, 10))
sns.boxplot(data=df,y='battery', x='os')
<Axes: xlabel='os', ylabel='battery'>
*IOS devices has wider range of battery capacity compared to the rest,however the median battery capacity is similiar accross OS.
# Create a boxplot
plt.figure(figsize=(35, 13))
sns.boxplot(data=df,y='battery', x='brand_name')
<Axes: xlabel='brand_name', ylabel='battery'>
# Filter data for devices with battery capacity > 4500 mAh
large_battery_devices = df[df['battery'] > 4500]
# Create a scatterplot to visualize the relationship between weight and battery capacity
plt.figure(figsize=(10, 6))
sns.scatterplot(data=large_battery_devices, x='battery', y='weight', hue='brand_name')
plt.xlabel('Battery Capacity (mAh)')
plt.ylabel('Weight')
plt.title('Weight vs Battery Capacity for Devices with > 4500 mAh')
plt.legend(title='Device Type')
plt.show()
As visualized, Phones with battery capacities over 4500mah tend to weigh heavier as the weight of the phone increses.
#checking missing values
df.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 179 selfie_camera_mp 2 int_memory 4 ram 4 battery 6 weight 7 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 screen_size_inches 0 dtype: int64
we need create a copy of our data set to avoid changes to it
df1 = df.copy()
#Imputing missing values with median
df1["main_camera_mp"] = df1["main_camera_mp"].fillna(df1["main_camera_mp"].median())
#checking missing values
df1.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 2 int_memory 4 ram 4 battery 6 weight 7 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 screen_size_inches 0 dtype: int64
The missing values in main camera mp has been cleared successfully
#Imputing missing values with median
df1["selfie_camera_mp"] = df1["selfie_camera_mp"].fillna(df1["selfie_camera_mp"].median())
#checking missing values
df1.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 0 int_memory 4 ram 4 battery 6 weight 7 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 screen_size_inches 0 dtype: int64
#Imputing missing values with median
df1["int_memory"] = df1["int_memory"].fillna(df1["int_memory"].median())
#checking missing values
df1.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 0 int_memory 0 ram 4 battery 6 weight 7 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 screen_size_inches 0 dtype: int64
#Imputing missing values with median
df1["ram"] = df1["ram"].fillna(df1["ram"].mean())
#checking missing values
df1.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 0 int_memory 0 ram 0 battery 6 weight 7 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 screen_size_inches 0 dtype: int64
df1["battery"] = df1["battery"].fillna(
value=df1.groupby(["weight"])["battery"].transform("median")
)
df1["battery"] = df1["battery"].fillna(
value=df1.groupby(["os"])["battery"].transform("median")
)
df1.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 0 int_memory 0 ram 0 battery 0 weight 7 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 screen_size_inches 0 dtype: int64
df1["weight"] = df1["weight"].fillna(
value=df1.groupby(["battery"])["weight"].transform("median")
)
df1["weight"] = df1["weight"].fillna(
value=df1.groupby(["os"])["weight"].transform("median")
)
df1.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 0 int_memory 0 ram 0 battery 0 weight 0 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 screen_size_inches 0 dtype: int64
Weight and all other missing values has been clear, we can then proceed to the next step.
So to further tune the data set, so it performs well with the linear regression model, the following variables will be transformed:
a. Weight will be converted from grams to kilogram, this will help normalize the data variance and even help with handling the outlier issues in weight.The weight column will then be dropped.
b. Convert screen size from cm to inches to improve interpretability. This has been done earlier during visualization. Hence I will be focusing on the former.
#Creating copy of data frame
df2=df1.copy()
#create new column for weight in KG
df2['weight_kg']=df1['weight']/1000
#drop weight column
df2.drop(columns=['weight'], inplace=True)
df2.head()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | release_year | days_used | normalized_used_price | normalized_new_price | screen_size_inches | weight_kg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 14.50 | yes | no | 13.00 | 5.00 | 64.00 | 3.00 | 3020.00 | 2020 | 127 | 4.31 | 4.72 | 5.71 | 0.15 |
| 1 | Honor | Android | 17.30 | yes | yes | 13.00 | 16.00 | 128.00 | 8.00 | 4300.00 | 2020 | 325 | 5.16 | 5.52 | 6.81 | 0.21 |
| 2 | Honor | Android | 16.69 | yes | yes | 13.00 | 8.00 | 128.00 | 8.00 | 4200.00 | 2020 | 162 | 5.11 | 5.88 | 6.57 | 0.21 |
| 3 | Honor | Android | 25.50 | yes | yes | 13.00 | 8.00 | 64.00 | 6.00 | 7250.00 | 2020 | 345 | 5.14 | 5.63 | 10.04 | 0.48 |
| 4 | Honor | Android | 15.32 | yes | no | 13.00 | 8.00 | 64.00 | 3.00 | 5000.00 | 2020 | 293 | 4.39 | 4.95 | 6.03 | 0.18 |
df2.drop(["screen_size"], axis=1, inplace=True)
df2.head(3)
| brand_name | os | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | release_year | days_used | normalized_used_price | normalized_new_price | screen_size_inches | weight_kg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | yes | no | 13.00 | 5.00 | 64.00 | 3.00 | 3020.00 | 2020 | 127 | 4.31 | 4.72 | 5.71 | 0.15 |
| 1 | Honor | Android | yes | yes | 13.00 | 16.00 | 128.00 | 8.00 | 4300.00 | 2020 | 325 | 5.16 | 5.52 | 6.81 | 0.21 |
| 2 | Honor | Android | yes | yes | 13.00 | 8.00 | 128.00 | 8.00 | 4200.00 | 2020 | 162 | 5.11 | 5.88 | 6.57 | 0.21 |
# outlier detection using boxplot
num_cols = df2.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 10))
for i, variable in enumerate(num_cols):
plt.subplot(3, 5, i + 1)
sns.boxplot(data=df2, x=variable)
plt.tight_layout(pad=2)
plt.show()
# Winsorization using scipy
from scipy.stats.mstats import winsorize
df2["main_camera_mp_winsorized"] = winsorize(df2["main_camera_mp"], limits=[0.01, 0.057])
#dropping os and brand name columns
df2.drop(["os","brand_name"], axis=1, inplace=True)
df2.head()
| 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | release_year | days_used | normalized_used_price | normalized_new_price | screen_size_inches | weight_kg | main_camera_mp_winsorized | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | yes | no | 13.00 | 5.00 | 64.00 | 3.00 | 3020.00 | 2020 | 127 | 4.31 | 4.72 | 5.71 | 0.15 | 13.00 |
| 1 | yes | yes | 13.00 | 16.00 | 128.00 | 8.00 | 4300.00 | 2020 | 325 | 5.16 | 5.52 | 6.81 | 0.21 | 13.00 |
| 2 | yes | yes | 13.00 | 8.00 | 128.00 | 8.00 | 4200.00 | 2020 | 162 | 5.11 | 5.88 | 6.57 | 0.21 | 13.00 |
| 3 | yes | yes | 13.00 | 8.00 | 64.00 | 6.00 | 7250.00 | 2020 | 345 | 5.14 | 5.63 | 10.04 | 0.48 | 13.00 |
| 4 | yes | no | 13.00 | 8.00 | 64.00 | 3.00 | 5000.00 | 2020 | 293 | 4.39 | 4.95 | 6.03 | 0.18 | 13.00 |
#visualizing the winsorized main camera mp variable
sns.boxplot(data=df2,x="main_camera_mp_winsorized")
plt.show()
#Converting Boolean variables in 4g column to numeric variables
df2['4g'] = df2['4g'].replace({'yes': 1, 'no': 0})
df2['4g'] = df2['4g'].astype(int)
#Converting Boolean variables in 5g column to numeric variables
df2['5g'] = df2['5g'].replace({'yes': 1, 'no': 0})
df2['5g'] = df2['5g'].astype(int)
df2.head()
| 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | release_year | days_used | normalized_used_price | normalized_new_price | screen_size_inches | weight_kg | main_camera_mp_winsorized | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 13.00 | 5.00 | 64.00 | 3.00 | 3020.00 | 2020 | 127 | 4.31 | 4.72 | 5.71 | 0.15 | 13.00 |
| 1 | 1 | 1 | 13.00 | 16.00 | 128.00 | 8.00 | 4300.00 | 2020 | 325 | 5.16 | 5.52 | 6.81 | 0.21 | 13.00 |
| 2 | 1 | 1 | 13.00 | 8.00 | 128.00 | 8.00 | 4200.00 | 2020 | 162 | 5.11 | 5.88 | 6.57 | 0.21 | 13.00 |
| 3 | 1 | 1 | 13.00 | 8.00 | 64.00 | 6.00 | 7250.00 | 2020 | 345 | 5.14 | 5.63 | 10.04 | 0.48 | 13.00 |
| 4 | 1 | 0 | 13.00 | 8.00 | 64.00 | 3.00 | 5000.00 | 2020 | 293 | 4.39 | 4.95 | 6.03 | 0.18 | 13.00 |
# defining X and y variables
#Independent variable
X = df2.drop(["normalized_used_price"], axis=1)
#Independent variable
y = df2["normalized_used_price"]
print(X.head())
print(y.head())
4g 5g main_camera_mp selfie_camera_mp int_memory ram battery \ 0 1 0 13.00 5.00 64.00 3.00 3020.00 1 1 1 13.00 16.00 128.00 8.00 4300.00 2 1 1 13.00 8.00 128.00 8.00 4200.00 3 1 1 13.00 8.00 64.00 6.00 7250.00 4 1 0 13.00 8.00 64.00 3.00 5000.00 release_year days_used normalized_new_price screen_size_inches \ 0 2020 127 4.72 5.71 1 2020 325 5.52 6.81 2 2020 162 5.88 6.57 3 2020 345 5.63 10.04 4 2020 293 4.95 6.03 weight_kg main_camera_mp_winsorized 0 0.15 13.00 1 0.21 13.00 2 0.21 13.00 3 0.48 13.00 4 0.18 13.00 0 4.31 1 5.16 2 5.11 3 5.14 4 4.39 Name: normalized_used_price, dtype: float64
# let's add the intercept to data
X = sm.add_constant(X)
# splitting the data in 70:30 ratio for train to test data
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in test data =", x_test.shape[0])
Number of rows in train data = 2417 Number of rows in test data = 1037
df2.head()
| 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | release_year | days_used | normalized_used_price | normalized_new_price | screen_size_inches | weight_kg | main_camera_mp_winsorized | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 13.00 | 5.00 | 64.00 | 3.00 | 3020.00 | 2020 | 127 | 4.31 | 4.72 | 5.71 | 0.15 | 13.00 |
| 1 | 1 | 1 | 13.00 | 16.00 | 128.00 | 8.00 | 4300.00 | 2020 | 325 | 5.16 | 5.52 | 6.81 | 0.21 | 13.00 |
| 2 | 1 | 1 | 13.00 | 8.00 | 128.00 | 8.00 | 4200.00 | 2020 | 162 | 5.11 | 5.88 | 6.57 | 0.21 | 13.00 |
| 3 | 1 | 1 | 13.00 | 8.00 | 64.00 | 6.00 | 7250.00 | 2020 | 345 | 5.14 | 5.63 | 10.04 | 0.48 | 13.00 |
| 4 | 1 | 0 | 13.00 | 8.00 | 64.00 | 3.00 | 5000.00 | 2020 | 293 | 4.39 | 4.95 | 6.03 | 0.18 | 13.00 |
Defining a custom function to plot boxplot and histogram together.
def histogram_boxplot(df2, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=df2, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=df2, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=df2, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
df2[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram(green line)
ax_hist2.axvline(
df2[feature].median(), color="black", linestyle="-"
) # Add median to the histogram(black line)
Defining a custom function with parameters set to aid plotting a barplot.
import seaborn as sns
import matplotlib.pyplot as plt
def labeled_barplot(df2, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(df2[feature]) # length of the column
count = df2[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=df2,
x=feature,
order=df2[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(df2, "4g", perc=True)
labeled_barplot(df2, "5g", perc=True)
#plot distribution of main camera mp
histogram_boxplot(df2, "main_camera_mp")
#plot distribution of main camera mp windsorized
histogram_boxplot(df2, "main_camera_mp_winsorized")
#plot distribution of weight_kg
histogram_boxplot(df2, "weight_kg")
#plot distribution of screen size inches
histogram_boxplot(df2, "screen_size_inches")
Next, lets define a custom function to aid plot distribution of screen size in cm, the plot the histogram and box plot.
def histogram_boxplot(project, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=project, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=project, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=project, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
project[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram(green line)
ax_hist2.axvline(
project[feature].median(), color="black", linestyle="-"
) # Add median to the histogram(black line)
#Visualizing for screen size in cm
histogram_boxplot(df, 'screen_size')
plt.title('Distribution of screen sizes in cm')
plt.xlabel('Screen size' + ' (cm)')
Text(0.5, 0, 'Screen size (cm)')
Observation
# creating a list of numerical columns
num_cols_transformed = df2.select_dtypes(include=np.number).columns.tolist()
#Plot correlation heatmap
plt.figure(figsize=(30,10))
sns.heatmap(
df2[num_cols_transformed].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
#Plotting scatterplots of all the numeric variables
plt.figure(figsize=(40,15))
sns.pairplot(
df2, diag_kind="kde"
)
<seaborn.axisgrid.PairGrid at 0x7fdd21623310>
<Figure size 4000x1500 with 0 Axes>
Observation
There is a high correlation between main_camera_mp and main camera_mp_winsorized which will introduced multicolinearity to my model later. Hence on of them may be dropped latter.
There is also a high correlation between release year and days used which might introduce multicolinearity,since release year is actually behave more like a categorical data,its likely going to be dropped later after testing for multicolinearity.
There are also relationships between selfie_camera_mp with days used, and a relationship between selfie_camera_mp with release year. These relationships may also introduce multicolinearity later in the model.
Lets build a linear regression model based on the processed data.
#Building regression model with train data
olsmodel = sm.OLS(y_train, x_train).fit()
print(olsmodel.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.843
Model: OLS Adj. R-squared: 0.842
Method: Least Squares F-statistic: 990.4
Date: Fri, 23 Feb 2024 Prob (F-statistic): 0.00
Time: 13:20:59 Log-Likelihood: 107.09
No. Observations: 2417 AIC: -186.2
Df Residuals: 2403 BIC: -105.1
Df Model: 13
Covariance Type: nonrobust
=============================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------
const -50.3842 8.896 -5.664 0.000 -67.828 -32.940
4g 0.0354 0.015 2.309 0.021 0.005 0.065
5g -0.0079 0.031 -0.255 0.799 -0.068 0.053
main_camera_mp 0.0053 0.003 1.685 0.092 -0.001 0.011
selfie_camera_mp 0.0138 0.001 12.966 0.000 0.012 0.016
int_memory 0.0001 6.68e-05 1.614 0.107 -2.32e-05 0.000
ram 0.0246 0.005 5.092 0.000 0.015 0.034
battery -1.348e-05 7.02e-06 -1.922 0.055 -2.72e-05 2.76e-07
release_year 0.0256 0.004 5.814 0.000 0.017 0.034
days_used 3.258e-05 3.04e-05 1.072 0.284 -2.7e-05 9.22e-05
normalized_new_price 0.4076 0.011 36.327 0.000 0.386 0.430
screen_size_inches 0.0664 0.008 8.522 0.000 0.051 0.082
weight_kg 0.9628 0.127 7.608 0.000 0.715 1.211
main_camera_mp_winsorized 0.0206 0.004 5.437 0.000 0.013 0.028
==============================================================================
Omnibus: 221.773 Durbin-Watson: 1.915
Prob(Omnibus): 0.000 Jarque-Bera (JB): 416.582
Skew: -0.617 Prob(JB): 3.47e-91
Kurtosis: 4.616 Cond. No. 7.44e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.44e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
Let's check the performance of the model using different metrics.
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mape_score(target, pred) # to compute MAPE
# creating a dataframe of metrics
df2_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df2_perf
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
olsmodel_train_perf = model_performance_regression(olsmodel, x_train, y_train)
olsmodel_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.23 | 0.18 | 0.84 | 0.84 | 4.37 |
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
olsmodel_test_perf = model_performance_regression(olsmodel, x_test, y_test)
olsmodel_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.24 | 0.18 | 0.84 | 0.84 | 4.46 |
The model is able to explain ~84% of the variation in the data
The train and test RMSE and MAE are low and comparable. So, our model is not suffering from overfitting
The MAPE on the test set suggests we cannot predict within 4.37% of the normalized used price of device data
However, lets work on the possible multicolinearities to see if model performance will improve
So, lets define custom function to check for multicolinearity.
Checking Multicolinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(predictors.values, i)
for i in range(len(predictors.columns))
]
return vif
#Displaying VIF results
checking_vif(x_train)
| feature | VIF | |
|---|---|---|
| 0 | const | 3548711.61 |
| 1 | 4g | 2.30 |
| 2 | 5g | 1.75 |
| 3 | main_camera_mp | 9.84 |
| 4 | selfie_camera_mp | 2.54 |
| 5 | int_memory | 1.25 |
| 6 | ram | 1.99 |
| 7 | battery | 3.80 |
| 8 | release_year | 4.58 |
| 9 | days_used | 2.58 |
| 10 | normalized_new_price | 2.61 |
| 11 | screen_size_inches | 6.22 |
| 12 | weight_kg | 5.76 |
| 13 | main_camera_mp_winsorized | 11.21 |
Managing Multicolinearity
To remove multicollinearity
I will be using a custom define function to carry out this task.
def treating_multicollinearity(predictors, target, high_vif_columns):
"""
Checking the effect of dropping the columns showing high multicollinearity
on model performance (adj. R-squared and RMSE)
predictors: independent variables
target: dependent variable
high_vif_columns: columns having high VIF
"""
# empty lists to store adj. R-squared and RMSE values
adj_r2 = []
rmse = []
# build ols models by dropping one of the high VIF columns at a time
# store the adjusted R-squared and RMSE in the lists defined previously
for cols in high_vif_columns:
# defining the new train set
train = predictors.loc[:, ~predictors.columns.str.startswith(cols)]
# create the model
olsmodel = sm.OLS(target, train).fit()
# adding adj. R-squared and RMSE to the lists
adj_r2.append(olsmodel.rsquared_adj)
rmse.append(np.sqrt(olsmodel.mse_resid))
# creating a dataframe for the results
temp = pd.DataFrame(
{
"col": high_vif_columns,
"Adj. R-squared after_dropping col": adj_r2,
"RMSE after dropping col": rmse,
}
).sort_values(by="Adj. R-squared after_dropping col", ascending=False)
temp.reset_index(drop=True, inplace=True)
return temp
#Checking the impact on the when either of the variables are dropped.
col_list = ["main_camera_mp", "main_camera_mp_winsorized"]
res = treating_multicollinearity(x_train, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | main_camera_mp_winsorized | 0.84 | 0.23 |
| 1 | main_camera_mp | 0.83 | 0.24 |
#Dropping the variable introducing multicolinearity and renaming the test and training data for predictors
col_to_drop = "main_camera_mp_winsorized"
x_train2 = x_train.drop(col_to_drop, axis=1)
x_test2 = x_test.drop(col_to_drop, axis=1)
# Check VIF now
vif = checking_vif(x_train2)
print("VIF after dropping ", col_to_drop)
vif
VIF after dropping main_camera_mp_winsorized
| feature | VIF | |
|---|---|---|
| 0 | const | 3546678.48 |
| 1 | 4g | 2.27 |
| 2 | 5g | 1.74 |
| 3 | main_camera_mp | 1.90 |
| 4 | selfie_camera_mp | 2.54 |
| 5 | int_memory | 1.25 |
| 6 | ram | 1.99 |
| 7 | battery | 3.79 |
| 8 | release_year | 4.58 |
| 9 | days_used | 2.57 |
| 10 | normalized_new_price | 2.53 |
| 11 | screen_size_inches | 6.16 |
| 12 | weight_kg | 5.56 |
Now that all multicolinearities has been cleared, lets run the regression again.
#Reruning model with optimized values
olsmodel1 = sm.OLS(y_train, x_train2).fit()
print(olsmodel1.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.841
Model: OLS Adj. R-squared: 0.840
Method: Least Squares F-statistic: 1058.
Date: Fri, 23 Feb 2024 Prob (F-statistic): 0.00
Time: 13:43:12 Log-Likelihood: 92.316
No. Observations: 2417 AIC: -158.6
Df Residuals: 2404 BIC: -83.36
Df Model: 12
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const -51.5418 8.946 -5.762 0.000 -69.084 -34.000
4g 0.0461 0.015 3.016 0.003 0.016 0.076
5g -0.0197 0.031 -0.634 0.526 -0.080 0.041
main_camera_mp 0.0207 0.001 14.854 0.000 0.018 0.023
selfie_camera_mp 0.0139 0.001 12.974 0.000 0.012 0.016
int_memory 8.579e-05 6.7e-05 1.280 0.201 -4.57e-05 0.000
ram 0.0240 0.005 4.939 0.000 0.014 0.033
battery -1.125e-05 7.05e-06 -1.596 0.111 -2.51e-05 2.57e-06
release_year 0.0262 0.004 5.908 0.000 0.017 0.035
days_used 4.178e-05 3.05e-05 1.369 0.171 -1.81e-05 0.000
normalized_new_price 0.4184 0.011 37.668 0.000 0.397 0.440
screen_size_inches 0.0705 0.008 9.039 0.000 0.055 0.086
weight_kg 0.8363 0.125 6.683 0.000 0.591 1.082
==============================================================================
Omnibus: 220.850 Durbin-Watson: 1.905
Prob(Omnibus): 0.000 Jarque-Bera (JB): 397.334
Skew: -0.629 Prob(JB): 5.25e-87
Kurtosis: 4.537 Cond. No. 7.43e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.43e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
Having dealt with multicolinearities,let's turn our attention to eliminating insignificant variables with high p values.
# initial list of columns
predictors = x_train2.copy()
cols = predictors.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
x_train_aux = predictors[cols]
# fitting the model
model = sm.OLS(y_train, x_train_aux).fit()
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
['const', '4g', 'main_camera_mp', 'selfie_camera_mp', 'ram', 'release_year', 'normalized_new_price', 'screen_size_inches', 'weight_kg']
renaming dataset with optimised p-values(high p-values removed)
# Renaming training and test data set with newly optimized values
x_train3 = x_train2[selected_features]
x_test3 = x_test2[selected_features]
# Re-running model with newly optimized data
olsmodel2 = sm.OLS(y_train, x_train3).fit()
print(olsmodel2.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.840
Model: OLS Adj. R-squared: 0.840
Method: Least Squares F-statistic: 1584.
Date: Fri, 23 Feb 2024 Prob (F-statistic): 0.00
Time: 14:15:26 Log-Likelihood: 89.094
No. Observations: 2417 AIC: -160.2
Df Residuals: 2408 BIC: -108.1
Df Model: 8
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const -41.7940 6.954 -6.010 0.000 -55.431 -28.157
4g 0.0442 0.015 2.964 0.003 0.015 0.073
main_camera_mp 0.0208 0.001 15.500 0.000 0.018 0.023
selfie_camera_mp 0.0140 0.001 13.224 0.000 0.012 0.016
ram 0.0220 0.004 5.069 0.000 0.013 0.030
release_year 0.0214 0.003 6.190 0.000 0.015 0.028
normalized_new_price 0.4187 0.011 39.001 0.000 0.398 0.440
screen_size_inches 0.0668 0.007 8.912 0.000 0.052 0.082
weight_kg 0.7739 0.118 6.571 0.000 0.543 1.005
==============================================================================
Omnibus: 222.363 Durbin-Watson: 1.906
Prob(Omnibus): 0.000 Jarque-Bera (JB): 408.444
Skew: -0.626 Prob(JB): 2.03e-89
Kurtosis: 4.578 Cond. No. 2.95e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.95e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
olsmodel_train_perf = model_performance_regression(olsmodel2, x_train3, y_train)
olsmodel_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.23 | 0.18 | 0.84 | 0.84 | 4.40 |
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
olsmodel_test_perf = model_performance_regression(olsmodel2, x_test3, y_test)
olsmodel_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.24 | 0.18 | 0.84 | 0.84 | 4.46 |
In order to make statistical inferences from a linear regression model, it is important to ensure that the assumptions of linear regression are satisfied.
Test for multicolinearity has been done and handled earlier,so we will not do it again in this section.
To check for linearity and independence, we will do the following:
# let us create a dataframe with actual, fitted and residual values
df_pred = pd.DataFrame()
df_pred["Actual Values"] = y_train # actual values
df_pred["Fitted Values"] = olsmodel2.fittedvalues # predicted values
df_pred["Residuals"] = olsmodel2.resid # residuals
df_pred.head()
| Actual Values | Fitted Values | Residuals | |
|---|---|---|---|
| 3026 | 4.09 | 3.86 | 0.23 |
| 1525 | 4.45 | 4.59 | -0.15 |
| 1128 | 4.32 | 4.29 | 0.03 |
| 3003 | 4.28 | 4.25 | 0.03 |
| 2907 | 4.46 | 4.48 | -0.02 |
# let's plot the fitted values vs residuals
sns.residplot(
data=df_pred, x="Fitted Values", y="Residuals", color="purple", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()
Observations
We will be checking for residual normality following these steps.
It can also be checked via a Q-Q plot of residuals. If the residuals follow a normal distribution, they will make a straight line plot, otherwise not.
An additional test I will perform will be to check for normality using the Shapiro-Wilk test. Whose interpretation will be as follow:
Null hypothesis: Residuals are normally distributed
Alternate hypothesis: Residuals are not normally distributed
P value greater than 0.05 means the null hypothesis will be accepted.
#Plotting the distribution of residuals
sns.histplot(data=df_pred, x="Residuals", kde=True)
plt.title("Normality of residuals")
plt.show()
import pylab
import scipy.stats as stats
stats.probplot(df_pred["Residuals"], dist="norm", plot=pylab)
plt.show()
#Running Shapiro-wilks test
stats.shapiro(df_pred["Residuals"])
ShapiroResult(statistic=0.969733476638794, pvalue=3.583213733486345e-22)
P-values are less than 0.05 which means the distribution of residuals deviate significantly from a normal distribution. However, since the deviation from normal is very insignificant as seen in the histogram plot, Assumption of normality of residuals will be accepted.
The goldfeldquandt test can also be used.
If we get a p-value > 0.05 we can say that the residuals are homoscedastic.
Otherwise, they are heteroscedastic.
Null hypothesis: Residuals are homoscedastic
Alternate hypothesis: Residuals have heteroscedasticity
#Importing required library for test
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
#Creating temp variable to run test
name = ["F statistic", "p-value"]
#Running test and printing result
test = sms.het_goldfeldquandt(df_pred["Residuals"], x_train3)
lzip(name, test)
[('F statistic', 1.0536786282386303), ('p-value', 0.18267222360599775)]
Since p-value > 0.05, we can say that the residuals are homoscedastic. So, this assumption is satisfied.
Recreating the final Model and printing results to gain more insights
#Assingning a name to the final optimized data for the model
x_train_final = x_train3.copy()
x_test_final = x_test3.copy()
#Running final model
olsmodel_final = sm.OLS(y_train, x_train_final).fit()
print(olsmodel_final.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.840
Model: OLS Adj. R-squared: 0.840
Method: Least Squares F-statistic: 1584.
Date: Fri, 23 Feb 2024 Prob (F-statistic): 0.00
Time: 15:05:21 Log-Likelihood: 89.094
No. Observations: 2417 AIC: -160.2
Df Residuals: 2408 BIC: -108.1
Df Model: 8
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const -41.7940 6.954 -6.010 0.000 -55.431 -28.157
4g 0.0442 0.015 2.964 0.003 0.015 0.073
main_camera_mp 0.0208 0.001 15.500 0.000 0.018 0.023
selfie_camera_mp 0.0140 0.001 13.224 0.000 0.012 0.016
ram 0.0220 0.004 5.069 0.000 0.013 0.030
release_year 0.0214 0.003 6.190 0.000 0.015 0.028
normalized_new_price 0.4187 0.011 39.001 0.000 0.398 0.440
screen_size_inches 0.0668 0.007 8.912 0.000 0.052 0.082
weight_kg 0.7739 0.118 6.571 0.000 0.543 1.005
==============================================================================
Omnibus: 222.363 Durbin-Watson: 1.906
Prob(Omnibus): 0.000 Jarque-Bera (JB): 408.444
Skew: -0.626 Prob(JB): 2.03e-89
Kurtosis: 4.578 Cond. No. 2.95e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.95e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
Next check model performance on training and test data and check for over fitting and under fitting
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
olsmodel_final_train_perf = model_performance_regression(
olsmodel_final, x_train_final, y_train
)
olsmodel_final_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.23 | 0.18 | 0.84 | 0.84 | 4.40 |
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
olsmodel_final_test_perf = model_performance_regression(
olsmodel_final, x_test_final, y_test
)
olsmodel_final_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.24 | 0.18 | 0.84 | 0.84 | 4.46 |
The model is able to explain ~84% of the variation in the data
The train and test RMSE and MAE are low and comparable. So, our model is not suffering from overfitting
The MAPE on the test set suggests we cannot predict within 4.46% of the normalized used price of device data
# predictions on the test set
pred = olsmodel_final.predict(x_test_final)
df_pred_test = pd.DataFrame({"Actual": y_test, "Predicted": pred})
df_pred_test.sample(20, random_state=1)
| Actual | Predicted | |
|---|---|---|
| 1995 | 4.57 | 4.38 |
| 2341 | 3.70 | 3.96 |
| 1913 | 3.59 | 3.71 |
| 688 | 4.31 | 4.11 |
| 650 | 4.52 | 5.13 |
| 2291 | 4.26 | 4.39 |
| 40 | 5.00 | 5.48 |
| 1884 | 3.88 | 4.07 |
| 2538 | 4.21 | 4.04 |
| 45 | 5.38 | 5.31 |
| 553 | 3.22 | 3.40 |
| 3229 | 4.65 | 4.68 |
| 1892 | 4.81 | 4.56 |
| 1696 | 3.91 | 4.07 |
| 3254 | 5.18 | 5.37 |
| 3143 | 4.84 | 4.76 |
| 136 | 4.02 | 4.02 |
| 3187 | 4.87 | 5.09 |
| 1754 | 3.68 | 3.54 |
| 1047 | 4.34 | 4.81 |
The model is able to explain ~84% of the variation in the data and within 4.46% of normalized used prices in the data set which is quite good. This indicates that the model is good for prediction as well as inference purposes.
If the normalized used price of device increase by 0.04 then 4g will increase by 1 unit, with all other variable held constant.
If the normalized used price of device increase by 0.02 then main_camera_mp will increase by 1 unit, with all other variable held constant.
If the normalized used price of device increase by 0.014 then selfie_camera_mp will increase by 1 unit, with all other variable held constant.
If the normalized used price of device increase by 0.022 then Ram will increase by 1 unit, with all other variable held constant.
If the normalized used price of device increase by 0.021 then release year will increase by 1 unit, with all other variable held constant.
If the normalized used price of device increase by 0.41 then normalized new price will increase by 1 unit, with all other variable held constant.
If the normalized used price of device increase by 0.067 then screen size inches will increase by 1 unit, with all other variable held constant.
If the normalized used price of device increase by 0.77 then weight_kg will increase by 1 unit, with all other variable held constant.
The data on phones with 5g needs to be rechecked for correctness, since over 90% of phones have 5g phone, but more than 50%of phones in the data set were released before 2019(5g release year).
Phones with 5 inches screen sized were the most sold, hence these phone should be prioritized for sale.
12.
-